FIFA Player Data Analysis Project - Dimensionality Reduction and Player Skill Visualization¶
Project Goal¶
The main goal of this project is to analyze and visualize FIFA player skills, focusing exclusively on their on-field attributes and technical abilities. The project deliberately omits players' personal, physical, and economic data (such as height, weight, nationality, or market value) to concentrate on a pure analysis of football skills.
The project utilizes advanced dimensionality reduction techniques:
- PCA (Principal Component Analysis)
- Classical MDS (Multidimensional Scaling)
- Non-metric MDS
Plots¶
To view interactive plots, run the Jupyter Notebook (.ipynb) file or open the corresponding (.html) plot file.
Data¶
The "Football Players Data" dataset from Kaggle was used for the analysis:
https://www.kaggle.com/datasets/maso0dahmed/football-players-data
Scope of Analyzed Data¶
The analysis covers the following skill categories:
- Technical Skills:
- Dribbling, passing, shooting
- Ball control, accuracy
- Physical Attributes in the Context of Play:
- Acceleration, agility
- Shot power, stamina
- Mental Traits:
- Aggression, composure
- Vision, game intelligence
Traits not directly related to football skills were omitted:
- Personal data (first name, last name, date of birth)
- Physical parameters (height, weight)
- Club/national team data
- Economic aspects (value, wage, release clause)
- Preferential traits (preferred foot, body type)
This approach allows for an objective analysis of pure football skills, without the influence of external or marketing factors.
import plotly.io as pio
pio.renderers.default = "notebook_connected"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
import plotly.graph_objects as go
from sklearn.manifold import MDS
from factor_analyzer import calculate_kmo
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.subplots as sp
from sklearn.metrics import pairwise_distances
from sklearn.metrics import euclidean_distances
from IPython.display import display, HTML
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
data = pd.read_csv("fifa_players.csv")
print("Basic data information:")
data.info()
Basic data information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 51 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 17954 non-null object 1 full_name 17954 non-null object 2 birth_date 17954 non-null object 3 age 17954 non-null int64 4 height_cm 17954 non-null float64 5 weight_kgs 17954 non-null float64 6 positions 17954 non-null object 7 nationality 17954 non-null object 8 overall_rating 17954 non-null int64 9 potential 17954 non-null int64 10 value_euro 17699 non-null float64 11 wage_euro 17708 non-null float64 12 preferred_foot 17954 non-null object 13 international_reputation(1-5) 17954 non-null int64 14 weak_foot(1-5) 17954 non-null int64 15 skill_moves(1-5) 17954 non-null int64 16 body_type 17954 non-null object 17 release_clause_euro 16117 non-null float64 18 national_team 857 non-null object 19 national_rating 857 non-null float64 20 national_team_position 857 non-null object 21 national_jersey_number 857 non-null float64 22 crossing 17954 non-null int64 23 finishing 17954 non-null int64 24 heading_accuracy 17954 non-null int64 25 short_passing 17954 non-null int64 26 volleys 17954 non-null int64 27 dribbling 17954 non-null int64 28 curve 17954 non-null int64 29 freekick_accuracy 17954 non-null int64 30 long_passing 17954 non-null int64 31 ball_control 17954 non-null int64 32 acceleration 17954 non-null int64 33 sprint_speed 17954 non-null int64 34 agility 17954 non-null int64 35 reactions 17954 non-null int64 36 balance 17954 non-null int64 37 shot_power 17954 non-null int64 38 jumping 17954 non-null int64 39 stamina 17954 non-null int64 40 strength 17954 non-null int64 41 long_shots 17954 non-null int64 42 aggression 17954 non-null int64 43 interceptions 17954 non-null int64 44 positioning 17954 non-null int64 45 vision 17954 non-null int64 46 penalties 17954 non-null int64 47 composure 17954 non-null int64 48 marking 17954 non-null int64 49 standing_tackle 17954 non-null int64 50 sliding_tackle 17954 non-null int64 dtypes: float64(7), int64(35), object(9) memory usage: 7.0+ MB
The dataset is dominated by numerical variables, but there are also columns of type object.
Exploratory Data Analysis¶
print(data.duplicated().sum())
0
The data is unique for each row, which simplifies further analysis.
data = data.drop(columns=['full_name', 'birth_date', 'positions', 'nationality','national_team',
'national_team_position', 'national_rating', 'national_jersey_number', 'value_euro',
'wage_euro','release_clause_euro','international_reputation(1-5)','age','height_cm', 'weight_kgs', 'preferred_foot', 'body_type'])
print("Missing values:")
print(data.isnull().sum().sort_values(ascending=False))
Missing values: name 0 aggression 0 balance 0 shot_power 0 jumping 0 stamina 0 strength 0 long_shots 0 interceptions 0 overall_rating 0 positioning 0 vision 0 penalties 0 composure 0 marking 0 standing_tackle 0 reactions 0 agility 0 sprint_speed 0 acceleration 0 ball_control 0 long_passing 0 freekick_accuracy 0 curve 0 dribbling 0 volleys 0 short_passing 0 heading_accuracy 0 finishing 0 crossing 0 skill_moves(1-5) 0 weak_foot(1-5) 0 potential 0 sliding_tackle 0 dtype: int64
Columns not directly related to on-field and technical skills were removed. Data does not contain missing values.
print(data.dtypes.value_counts())
data.head()
int64 33 object 1 Name: count, dtype: int64
| name | overall_rating | potential | weak_foot(1-5) | skill_moves(1-5) | crossing | finishing | heading_accuracy | short_passing | volleys | dribbling | curve | freekick_accuracy | long_passing | ball_control | acceleration | sprint_speed | agility | reactions | balance | shot_power | jumping | stamina | strength | long_shots | aggression | interceptions | positioning | vision | penalties | composure | marking | standing_tackle | sliding_tackle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | L. Messi | 94 | 94 | 4 | 4 | 86 | 95 | 70 | 92 | 86 | 97 | 93 | 94 | 89 | 96 | 91 | 86 | 93 | 95 | 95 | 85 | 68 | 72 | 66 | 94 | 48 | 22 | 94 | 94 | 75 | 96 | 33 | 28 | 26 |
| 1 | C. Eriksen | 88 | 89 | 5 | 4 | 88 | 81 | 52 | 91 | 80 | 84 | 86 | 87 | 89 | 91 | 76 | 73 | 80 | 88 | 81 | 84 | 50 | 92 | 58 | 89 | 46 | 56 | 84 | 91 | 67 | 88 | 59 | 57 | 22 |
| 2 | P. Pogba | 88 | 91 | 4 | 5 | 80 | 75 | 75 | 86 | 85 | 87 | 85 | 82 | 90 | 90 | 71 | 79 | 76 | 82 | 66 | 90 | 83 | 88 | 87 | 82 | 78 | 64 | 82 | 88 | 82 | 87 | 63 | 67 | 67 |
| 3 | L. Insigne | 88 | 88 | 4 | 4 | 86 | 77 | 56 | 85 | 74 | 90 | 87 | 77 | 78 | 93 | 94 | 86 | 94 | 83 | 93 | 75 | 53 | 75 | 44 | 84 | 34 | 26 | 83 | 87 | 61 | 83 | 51 | 24 | 22 |
| 4 | K. Koulibaly | 88 | 91 | 3 | 2 | 30 | 22 | 83 | 68 | 14 | 69 | 28 | 28 | 60 | 63 | 70 | 75 | 50 | 82 | 40 | 55 | 81 | 75 | 94 | 15 | 87 | 88 | 24 | 49 | 33 | 80 | 91 | 88 | 87 |
The data is in the appropriate formats for further numerical analysis, except for the name column, which is used only for player identification.
# Histograms of numerical variables
data.hist(bins=20, figsize=(20, 30), layout=(11, 4))
plt.show()
Most skills have a distribution close to normal, although some may be skewed (e.g., overall_rating and potential tend to concentrate at higher values).
# Box plots for numerical variables
numerical_columns = data.drop(columns=['weak_foot(1-5)', 'skill_moves(1-5)', 'name'])
plt.figure(figsize=(16, 30))
for i, column in enumerate(numerical_columns.columns, 1):
plt.subplot(len(numerical_columns.columns) // 3 + 1, 6, i)
sns.boxplot(y=numerical_columns[column])
plt.title(f'Chart for {column}')
plt.tight_layout()
plt.show()
Box plots confirm the presence of outliers in many numerical columns. Removing these outliers, as planned in later steps, is justified to obtain more stable results from the dimensionality reduction analysis.
data_no_name = data.drop(columns='name')
corr_matrix = data_no_name.corr()
plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix.round(1), annot=True, cmap="coolwarm")
plt.title("Correlation matrix")
plt.show()
- Many on-field skills are strongly correlated with each other (e.g., dribbling with ball_control, short_passing with vision, marking with standing_tackle). This is expected as many skills are interrelated in football gameplay.
- overall_rating is strongly correlated with many skills, suggesting it is a good measure of a player's overall level.
- weak_foot(1-5) and skill_moves(1-5) have lower correlations with most other skills, which may suggest they are more independent traits.
Checking if variables are significantly correlated¶
# Calculating KMO coefficients
kmo_all, kmo_model = calculate_kmo(data_no_name)
kmo_table = pd.DataFrame({
'Variable': data_no_name.columns,
'KMO value': kmo_all
})
print("KMO Value for Each Variable:")
print(kmo_table.sort_values(by='KMO value', ascending=True))
print(f"\nKMO Value for the Entire Dataset: {kmo_model}")
KMO Value for Each Variable:
Variable KMO value
22 strength 0.837652
0 overall_rating 0.866812
1 potential 0.873593
31 standing_tackle 0.899356
20 jumping 0.899462
32 sliding_tackle 0.904767
17 reactions 0.916115
14 acceleration 0.933145
15 sprint_speed 0.934849
6 heading_accuracy 0.951463
18 balance 0.957829
12 long_passing 0.962806
25 interceptions 0.964330
7 short_passing 0.967629
5 finishing 0.973288
13 ball_control 0.973385
9 dribbling 0.974514
23 long_shots 0.974832
11 freekick_accuracy 0.975586
26 positioning 0.978110
10 curve 0.978915
27 vision 0.978959
16 agility 0.979186
19 shot_power 0.979933
4 crossing 0.980750
24 aggression 0.983179
29 composure 0.983556
30 marking 0.983834
28 penalties 0.983867
8 volleys 0.986962
21 stamina 0.988646
3 skill_moves(1-5) 0.992072
2 weak_foot(1-5) 0.996484
KMO Value for the Entire Dataset: 0.9651424518001596
Kaiser-Meyer-Olkin coefficients are greater than 0.7, so all variables are significantly correlated.
Removing outliers to obtain more stable, consistent, and interpretable analysis results¶
numerical_columns = data_no_name.drop(columns=['weak_foot(1-5)', 'skill_moves(1-5)'])
data_to_check = numerical_columns
Q1 = data_to_check.quantile(0.25)
Q3 = data_to_check.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = ((data_to_check < lower_bound) | (data_to_check > upper_bound))
data_no_outliers = data[~outliers.any(axis=1)]
print(f"Shape of data after removing outliers: {data_no_outliers.shape}")
data = data_no_outliers
print("Data after removing outliers:")
data.reset_index(drop=True, inplace=True)
data.info()
Shape of data after removing outliers: (14979, 34) Data after removing outliers: <class 'pandas.core.frame.DataFrame'> RangeIndex: 14979 entries, 0 to 14978 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 14979 non-null object 1 overall_rating 14979 non-null int64 2 potential 14979 non-null int64 3 weak_foot(1-5) 14979 non-null int64 4 skill_moves(1-5) 14979 non-null int64 5 crossing 14979 non-null int64 6 finishing 14979 non-null int64 7 heading_accuracy 14979 non-null int64 8 short_passing 14979 non-null int64 9 volleys 14979 non-null int64 10 dribbling 14979 non-null int64 11 curve 14979 non-null int64 12 freekick_accuracy 14979 non-null int64 13 long_passing 14979 non-null int64 14 ball_control 14979 non-null int64 15 acceleration 14979 non-null int64 16 sprint_speed 14979 non-null int64 17 agility 14979 non-null int64 18 reactions 14979 non-null int64 19 balance 14979 non-null int64 20 shot_power 14979 non-null int64 21 jumping 14979 non-null int64 22 stamina 14979 non-null int64 23 strength 14979 non-null int64 24 long_shots 14979 non-null int64 25 aggression 14979 non-null int64 26 interceptions 14979 non-null int64 27 positioning 14979 non-null int64 28 vision 14979 non-null int64 29 penalties 14979 non-null int64 30 composure 14979 non-null int64 31 marking 14979 non-null int64 32 standing_tackle 14979 non-null int64 33 sliding_tackle 14979 non-null int64 dtypes: int64(33), object(1) memory usage: 3.9+ MB
Standardization of variables to ensure all data are on the same scale, which prevents variables with larger variance from dominating the analysis¶
columns_to_standardize = data_no_name.columns
scaler = StandardScaler()
data[columns_to_standardize] = scaler.fit_transform(data[columns_to_standardize])
data.head()
| name | overall_rating | potential | weak_foot(1-5) | skill_moves(1-5) | crossing | finishing | heading_accuracy | short_passing | volleys | dribbling | curve | freekick_accuracy | long_passing | ball_control | acceleration | sprint_speed | agility | reactions | balance | shot_power | jumping | stamina | strength | long_shots | aggression | interceptions | positioning | vision | penalties | composure | marking | standing_tackle | sliding_tackle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | H. Ziyech | 2.714519 | 2.484214 | -1.584602 | 2.403249 | 2.133340 | 1.126017 | -0.375498 | 2.506459 | 2.076935 | 1.941640 | 2.592602 | 2.121335 | 2.719777 | 2.086824 | 0.685890 | 0.508981 | 1.653158 | 2.478350 | 0.979633 | 1.555402 | -2.708322 | 1.718042 | -0.606598 | 1.581430 | 0.855658 | 0.606882 | 1.458286 | 2.583464 | 1.608283 | 2.294305 | 0.033842 | 0.724156 | -0.734775 |
| 1 | T. Werner | 2.714519 | 2.656635 | 1.561290 | 0.766220 | 1.050105 | 2.300775 | 1.747288 | 1.533099 | 2.625709 | 1.857762 | 0.781303 | 0.645802 | -0.108890 | 2.193150 | 2.219166 | 2.278134 | 1.227499 | 2.837425 | 0.807428 | 1.402184 | 1.086436 | 1.436045 | -0.183542 | 1.194589 | 0.218885 | -0.731041 | 2.084700 | 0.431368 | 1.608283 | 1.374401 | -0.721882 | -1.694755 | -1.628873 |
| 2 | Alex Telles | 2.714519 | 2.484214 | -1.584602 | 0.766220 | 2.494418 | 0.940529 | 1.481940 | 2.182006 | 1.390967 | 1.773883 | 2.391347 | 2.188405 | 2.470188 | 1.980498 | 1.407432 | 1.533227 | 0.546444 | 2.478350 | 0.290813 | 1.555402 | 1.371043 | 2.282035 | 0.154903 | 1.516956 | 1.280174 | 1.570187 | 1.458286 | 1.467563 | 1.689031 | 1.987670 | 1.661555 | 1.512931 | 1.526768 |
| 3 | Jorginho | 2.559392 | 2.139372 | -0.011656 | 0.766220 | 1.483399 | 0.445895 | -0.110150 | 2.830913 | 1.665354 | 1.773883 | 1.586325 | 1.987195 | 2.553385 | 2.299475 | 0.144733 | -0.701492 | 0.886971 | 2.358659 | 0.549121 | 0.865922 | -0.810943 | 1.060050 | -0.014319 | 0.678802 | 1.350927 | 1.356119 | 0.901474 | 2.503757 | 2.577266 | 1.885459 | 0.615168 | 1.302591 | 1.053422 |
| 4 | E. Višća | 2.559392 | 1.966951 | 1.561290 | 2.403249 | 1.844477 | 1.867970 | -1.083093 | 2.073855 | 1.871145 | 1.773883 | 1.519240 | 0.914080 | 1.222247 | 1.980498 | 1.948588 | 1.905681 | 2.249082 | 2.119275 | 1.840657 | 1.478793 | 0.232615 | 1.530044 | -0.691209 | 1.645903 | -0.913157 | 0.071713 | 1.945497 | 1.706684 | 1.850529 | 2.192094 | 0.208240 | -0.643055 | -0.524399 |
numerical_columns = data.drop(columns=['weak_foot(1-5)', 'skill_moves(1-5)', 'name'])
plt.figure(figsize=(18, 26))
for i, column in enumerate(numerical_columns.columns, 1):
plt.subplot(len(numerical_columns.columns) // 3 + 1, 6, i)
sns.boxplot(y=numerical_columns[column])
plt.title(f'Boxplot of {column}')
plt.tight_layout()
plt.show()
PCA¶
Description
Principal Component Analysis (PCA) is a dimensionality reduction method that transforms data in such a way as to maximize variance along successive axes (principal components). The result is a set of new, orthogonal axes that can be used to analyze data with a reduced number of dimensions.
How it works
- PCA computes the covariance matrix of the data.
- Eigenvalues and eigenvectors of the covariance matrix are then calculated.
- The data are transformed onto the axes with the highest variance.
Advantages
- Simplicity and computational efficiency: Fast to implement, especially for large datasets.
- Interpretability: Reduces dimensionality while preserving as much information as possible.
- Versatility: Used in image analysis, genetic data, finance, and many other fields.
Disadvantages
- Linearity assumption: PCA works best for data with linear structures.
- Sensitivity to scaling: Data must be standardized; otherwise, variables with large scales will dominate.
- Lack of intuitive interpretation: Principal components are linear combinations of variables, which can make interpretation difficult.
data_with_name = data.copy()
data = data.drop(columns='name')
pca = PCA()
pca_transformed = pca.fit_transform(data)
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
explained_variance_table = pd.DataFrame({
"Component": [f"PC{i+1}" for i in range(len(explained_variance_ratio))],
"Explained Variance": explained_variance_ratio,
"Cumulative Variance": cumulative_variance
})
print(explained_variance_table)
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.title("Cumulative proportion of variance explained by PCA")
plt.xlabel("Number of principal components")
plt.ylabel("Cumulative variance")
plt.grid()
plt.show()
Component Explained Variance Cumulative Variance 0 PC1 0.373210 0.373210 1 PC2 0.200211 0.573422 2 PC3 0.089520 0.662942 3 PC4 0.055820 0.718761 4 PC5 0.034537 0.753298 5 PC6 0.027262 0.780560 6 PC7 0.023706 0.804265 7 PC8 0.017762 0.822028 8 PC9 0.015249 0.837277 9 PC10 0.014827 0.852104 10 PC11 0.013365 0.865469 11 PC12 0.012862 0.878331 12 PC13 0.011945 0.890277 13 PC14 0.010237 0.900514 14 PC15 0.009611 0.910124 15 PC16 0.008579 0.918703 16 PC17 0.008171 0.926874 17 PC18 0.007634 0.934509 18 PC19 0.007520 0.942029 19 PC20 0.006704 0.948732 20 PC21 0.006544 0.955276 21 PC22 0.006009 0.961285 22 PC23 0.005479 0.966764 23 PC24 0.004879 0.971644 24 PC25 0.004679 0.976323 25 PC26 0.004417 0.980739 26 PC27 0.003733 0.984473 27 PC28 0.003509 0.987981 28 PC29 0.003326 0.991308 29 PC30 0.002808 0.994115 30 PC31 0.002529 0.996644 31 PC32 0.002451 0.999095 32 PC33 0.000905 1.000000
Reducing the data to 6 dimensions is an appropriate choice.
These 6 dimensions explain over 80% of the total variance, which is a good result compared to the original 32 features.
loadings = pca.components_.T
loadings_df = pd.DataFrame(loadings, columns=[f'PC{i+1}' for i in range(loadings.shape[1])], index=data.columns)
print("Factor Loadings (Eigenvectors):")
display(loadings_df.iloc[:, :10])
Factor Loadings (Eigenvectors):
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| overall_rating | 0.189898 | 0.222089 | 0.097214 | 0.165908 | 0.210745 | -0.006888 | 0.039548 | 0.021284 | -0.106415 | -0.097550 |
| potential | 0.122040 | 0.104578 | 0.002085 | 0.157593 | 0.651787 | -0.008396 | 0.114058 | 0.172163 | 0.449053 | -0.033428 |
| weak_foot(1-5) | 0.097084 | -0.024616 | 0.032603 | -0.009779 | 0.037442 | 0.946160 | -0.290813 | 0.054680 | 0.004079 | -0.001406 |
| skill_moves(1-5) | 0.217889 | -0.031715 | -0.044146 | 0.022430 | 0.199876 | -0.013300 | -0.001946 | 0.044601 | -0.362053 | -0.038879 |
| crossing | 0.207258 | 0.061363 | -0.206870 | -0.093966 | -0.100534 | -0.072993 | -0.121276 | 0.173366 | -0.366625 | 0.154706 |
| finishing | 0.225211 | -0.142942 | 0.183576 | 0.026342 | -0.052874 | -0.031064 | -0.030661 | -0.077279 | 0.115931 | -0.033831 |
| heading_accuracy | -0.000204 | 0.202776 | 0.353460 | 0.250195 | 0.042464 | -0.012272 | 0.069931 | 0.119175 | -0.073333 | 0.041270 |
| short_passing | 0.209369 | 0.175842 | -0.049048 | -0.147193 | 0.131935 | 0.003644 | 0.037958 | -0.264946 | 0.087424 | 0.258561 |
| volleys | 0.229245 | -0.078735 | 0.180136 | -0.023471 | -0.103722 | -0.015288 | 0.034454 | 0.120590 | 0.060046 | -0.086172 |
| dribbling | 0.253183 | -0.043294 | -0.085949 | 0.005102 | 0.084538 | -0.062030 | -0.060536 | -0.085597 | -0.118803 | 0.123439 |
| curve | 0.238886 | 0.003263 | -0.036723 | -0.151297 | -0.131930 | -0.032705 | 0.008588 | 0.303123 | -0.110820 | 0.023229 |
| freekick_accuracy | 0.212711 | 0.027185 | 0.002251 | -0.253559 | -0.197931 | 0.002038 | 0.051489 | 0.386823 | 0.026521 | -0.041093 |
| long_passing | 0.164786 | 0.205193 | -0.114389 | -0.256203 | 0.026020 | 0.013607 | 0.021647 | -0.223648 | 0.146197 | 0.380880 |
| ball_control | 0.255784 | 0.060594 | -0.006268 | -0.018430 | 0.140524 | -0.022629 | 0.002546 | -0.158509 | -0.082846 | 0.116276 |
| acceleration | 0.138501 | -0.138224 | -0.304121 | 0.345035 | 0.027815 | -0.058328 | -0.157062 | 0.191332 | 0.040225 | 0.097081 |
| sprint_speed | 0.125152 | -0.121592 | -0.259769 | 0.404012 | 0.038934 | -0.093929 | -0.254909 | 0.232427 | 0.045104 | 0.185666 |
| agility | 0.183857 | -0.106357 | -0.279665 | 0.171424 | -0.095473 | 0.043161 | 0.119436 | -0.073781 | -0.026966 | -0.057500 |
| reactions | 0.178046 | 0.207681 | 0.084273 | 0.119410 | 0.110487 | 0.006721 | 0.099846 | -0.089572 | -0.212717 | -0.339415 |
| balance | 0.134547 | -0.108123 | -0.342453 | 0.028387 | -0.110837 | 0.121638 | 0.401240 | -0.097274 | 0.129253 | -0.317632 |
| shot_power | 0.221804 | 0.017762 | 0.193568 | -0.026135 | -0.129456 | -0.060017 | -0.035677 | 0.149328 | 0.366314 | 0.106184 |
| jumping | 0.002160 | 0.125590 | 0.050272 | 0.428606 | -0.320096 | 0.201334 | 0.576094 | -0.024110 | -0.025258 | 0.399388 |
| stamina | 0.101496 | 0.169318 | -0.120961 | 0.270132 | -0.299611 | -0.069627 | -0.360607 | -0.368103 | 0.246412 | -0.330740 |
| strength | -0.033695 | 0.215302 | 0.321673 | 0.211474 | -0.111876 | -0.070500 | -0.312858 | -0.015055 | -0.105214 | 0.202918 |
| long_shots | 0.243582 | -0.029713 | 0.125039 | -0.082631 | -0.166592 | -0.039008 | -0.031908 | 0.040039 | 0.260143 | 0.036716 |
| aggression | 0.013986 | 0.303541 | 0.046471 | 0.092872 | -0.206676 | -0.002097 | -0.028371 | 0.036357 | 0.067366 | -0.104563 |
| interceptions | -0.038974 | 0.347487 | -0.158131 | -0.065788 | -0.065597 | 0.012295 | -0.012313 | 0.089108 | 0.015050 | -0.075415 |
| positioning | 0.239290 | -0.084911 | 0.058372 | 0.038727 | -0.087915 | -0.076639 | -0.093992 | -0.157456 | -0.089798 | -0.057098 |
| vision | 0.242493 | 0.021382 | -0.017240 | -0.172848 | -0.029167 | 0.003266 | -0.002752 | -0.269411 | 0.058484 | 0.101481 |
| penalties | 0.198864 | -0.082207 | 0.227309 | -0.051340 | -0.100543 | -0.014058 | 0.105514 | 0.232278 | 0.077886 | -0.230535 |
| composure | 0.192121 | 0.178066 | 0.113956 | 0.035761 | 0.098337 | 0.038604 | 0.107806 | -0.106667 | -0.257745 | -0.190243 |
| marking | -0.047726 | 0.331611 | -0.158237 | -0.052812 | -0.019080 | 0.006491 | -0.008921 | 0.117631 | 0.051565 | -0.063872 |
| standing_tackle | -0.059788 | 0.341337 | -0.180202 | -0.082049 | -0.028068 | 0.000934 | -0.031103 | 0.141261 | 0.046079 | -0.029911 |
| sliding_tackle | -0.067648 | 0.331331 | -0.200135 | -0.078208 | -0.026480 | -0.001080 | -0.015825 | 0.162620 | 0.044290 | -0.015829 |
Analysis of Feature Influence on Principal Components¶
Based on the factor loadings (eigenvectors) for the first 10 principal components:
PC1 (37.32% of variance)¶
High loadings: ball_control, dribbling, long_shots, vision, positioning, curve, volleys, finishing, shot_power, skill_moves(1-5), freekick_accuracy, short_passing, crossing,
Interpretation: PC1 Focuses mainly on technical quality needed for players operating primarily in the opponent's half.
PC2 (20.02% of variance)¶
High positive loadings: interceptions, standing_tackle, marking, sliding_tackle, aggression, overall_rating, strength, reactions, long_passing, heading_accuracy,
High negative loadings: finishing, acceleration, sprint_speed, agility, balance.
Interpretation: This component is primarily related to defensive abilities, physical strength, and tackling. Players with high PC2 scores are strong defensively, while those with low scores tend to be faster and more agile.
PC3 (8.95% of variance)¶
High positive loadings: heading_accuracy, strength, penalties, shot_power, finishing, volleys,
High negative loadings: balance, acceleration, agility, sprint_speed, crossing, sliding_tackle.
Interpretation: PC3 are typical characteristics of a striker.
PC4 (5.58% of variance)¶
High positive loadings: jumping, sprint_speed, acceleration, heading_accuracy, stamina, strength. High negative loadings: freekick_accuracy, long_passing, vision, curve, long_shots, crossing, penalties.
Interpretation: This component highlights athletic traits like jumping and speed, while showing negative influence from technical passing and shooting skills.
PC5 (3.45% of variance)¶
High positive loadings: potential, overall_rating, skill_moves.
Interpretation: These are keys ratings for assessing a player's abilities.
PC6 (2.73% of variance)¶
High positive loadings: weak_foot(1-5)
Interpretation: Component focused on the quality of the weaker foot.
PC7 and higher¶
These components explain smaller portions of the variance and represent more nuanced combinations of skills.
loadings = pca.components_.T
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
loadings_df = pd.DataFrame(loadings[:, :2], columns=['PC1', 'PC2'], index=data.columns)
print(loadings_df)
plt.figure(figsize=(10, 8))
plt.scatter(pca_transformed[:, 0], pca_transformed[:, 1], alpha=0.5, label="Observations")
scale_factor = 3 # Scaling factor for better vector visibility
for i, var in enumerate(data.columns):
plt.arrow(
0, 0,
loadings[i, 0] * scale_factor, loadings[i, 1] * scale_factor,
color="r", alpha=0.5, head_width=0.5, linewidth=2
)
plt.text(
loadings[i, 0] * scale_factor * 1.15,
loadings[i, 1] * scale_factor * 1.15,
var, color="black", fontsize=12
)
plt.title("Biplot: PCA (PC1 vs PC2)")
plt.xlabel("First Principal Component (PC1)")
plt.ylabel("Second Principal Component (PC2)")
plt.axhline(0, color="grey", linestyle="--", linewidth=0.7)
plt.axvline(0, color="grey", linestyle="--", linewidth=0.7)
plt.legend()
plt.grid()
plt.show()
PC1 PC2 overall_rating 0.666452 0.570878 potential 0.428303 0.268817 weak_foot(1-5) 0.340719 -0.063274 skill_moves(1-5) 0.764686 -0.081524 crossing 0.727376 0.157733 finishing 0.790384 -0.367430 heading_accuracy -0.000716 0.521234 short_passing 0.734785 0.452000 volleys 0.804539 -0.202387 dribbling 0.888551 -0.111287 curve 0.838375 0.008387 freekick_accuracy 0.746516 0.069878 long_passing 0.578319 0.527448 ball_control 0.897679 0.155755 acceleration 0.486072 -0.355305 sprint_speed 0.439225 -0.312551 agility 0.645251 -0.273390 reactions 0.624855 0.533842 balance 0.472198 -0.277930 shot_power 0.778428 0.045657 jumping 0.007580 0.322827 stamina 0.356203 0.435231 strength -0.118252 0.553431 long_shots 0.854857 -0.076378 aggression 0.049084 0.780248 interceptions -0.136781 0.893213 positioning 0.839794 -0.218262 vision 0.851036 0.054963 penalties 0.697919 -0.211312 composure 0.674254 0.457718 marking -0.167495 0.852402 standing_tackle -0.209829 0.877403 sliding_tackle -0.237412 0.851683
Analysis of the Plot¶
Main Clusters of Variables in PCA¶
PC1 (Offensive Skills)¶
- High values for skills typical of offensive positions (finishing, crossing, curve, vision)
- Represents technical abilities required by midfielders and forwards
- Indicates traits related to creative ball play
PC2 (Defensive Skills)¶
- Dominated by attributes key for defenders (strength, heading_accuracy, interceptions)
- Captures physical and tactical skills needed in defensive play
- Reflects abilities useful for ball recovery and duels
Position Distribution¶
- Bottom-right: Forwards (high PC1, low PC2)
- Top-right: Midfielders (high PC1 and PC2)
- Top area: Defenders (high PC2)
Main Analysis¶
In the next section, a full analysis will be performed on a 10% sample of the data to simplify the analysis and reduce algorithm runtime.
data_sample_with_name = data_with_name.sample(frac=0.1, random_state=0)
data_sample_with_name.reset_index(drop=True, inplace=True)
data2 = pd.read_csv("fifa_players.csv")
positions_df = data2[['name', 'positions']]
data_sample_with_name = data_sample_with_name.merge(positions_df,
on='name',
how='inner')
cols = data_sample_with_name.columns.tolist()
cols.remove('positions')
cols.insert(1, 'positions')
data_sample_with_name = data_sample_with_name[cols]
data_sample = data_sample_with_name.drop(columns=['name', 'positions'])
position_map = {
'CB': 'Center Back',
'GK': 'Goalkeeper',
'CM': 'Central Midfielder',
'CDM': 'Defensive Midfielder',
'CAM': 'Attacking Midfielder',
'CF': 'Center Forward',
'ST': 'Striker',
'RW': 'Right Winger',
'LW': 'Left Winger',
'RM': 'Right Midfielder',
'LM': 'Left Midfielder',
'RB': 'Right Back',
'LB': 'Left Back'
}
def replace_positions(position_str):
positions = position_str.split(',')
full_positions = [position_map.get(pos, pos) for pos in positions]
return ', '.join(full_positions)
data_sample_with_name['positions'] = data_sample_with_name['positions'].apply(replace_positions)
data_sample_with_name.head()
| name | positions | overall_rating | potential | weak_foot(1-5) | skill_moves(1-5) | crossing | finishing | heading_accuracy | short_passing | volleys | dribbling | curve | freekick_accuracy | long_passing | ball_control | acceleration | sprint_speed | agility | reactions | balance | shot_power | jumping | stamina | strength | long_shots | aggression | interceptions | positioning | vision | penalties | composure | marking | standing_tackle | sliding_tackle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M. Çağıran | Defensive Midfielder, Central Midfielder | 1.163256 | 1.104846 | 1.561290 | 0.766220 | 1.266752 | 0.816871 | 1.658838 | 1.424947 | 0.362016 | 1.102854 | 1.317984 | 1.853056 | 1.721424 | 1.129891 | 0.054541 | -0.049699 | 0.205916 | 1.281434 | -0.484108 | 1.555402 | 1.465912 | 0.872052 | 1.001016 | 1.452483 | 0.997164 | 1.195569 | 0.692669 | 1.387855 | 1.123791 | 1.374401 | 1.022096 | 0.987081 | 0.790451 |
| 1 | J. Orejuela | Defensive Midfielder, Central Midfielder | 0.387624 | 0.242741 | -0.011656 | -0.870809 | -0.249778 | 0.074918 | 0.597445 | 0.992343 | -1.970273 | -0.490840 | -0.359145 | 1.115289 | 1.055855 | -0.039694 | 0.054541 | -0.235926 | -0.390007 | 0.922359 | 0.290813 | 0.942531 | -1.095550 | 1.342046 | 0.239514 | 0.485382 | 0.006627 | 0.874467 | -0.490556 | 0.351661 | 0.235556 | 0.863343 | 0.789566 | 0.881911 | 0.737857 |
| 2 | A. Bakayoko | Striker | -0.698260 | 0.587583 | -0.011656 | 0.766220 | 0.544595 | 0.013089 | -0.640846 | -0.089169 | 0.362016 | 0.767339 | 1.116729 | 0.981150 | 0.140699 | 0.704587 | 0.415312 | 0.508981 | 0.376180 | -1.232092 | -0.053596 | 0.636096 | -0.905812 | -1.195924 | -1.791156 | -0.417246 | -1.196167 | -0.302905 | -0.490556 | -0.285997 | 0.962294 | -0.874254 | 0.324505 | 0.303476 | 0.159323 |
| 3 | A. Bakayoko | Right Winger, Right Back | -0.698260 | 0.587583 | -0.011656 | 0.766220 | 0.544595 | 0.013089 | -0.640846 | -0.089169 | 0.362016 | 0.767339 | 1.116729 | 0.981150 | 0.140699 | 0.704587 | 0.415312 | 0.508981 | 0.376180 | -1.232092 | -0.053596 | 0.636096 | -0.905812 | -1.195924 | -1.791156 | -0.417246 | -1.196167 | -0.302905 | -0.490556 | -0.285997 | 0.962294 | -0.874254 | 0.324505 | 0.303476 | 0.159323 |
| 4 | Kim Hyo Gi | Striker | -0.698260 | -1.653890 | -1.584602 | 0.766220 | -1.982954 | 0.940529 | 0.066749 | -1.495135 | 0.430613 | -0.239205 | -1.298337 | -1.098010 | -2.521576 | -0.146020 | -0.486616 | -0.422152 | -0.390007 | -0.873016 | -0.225801 | 0.253051 | -1.190419 | -1.007926 | -0.268153 | -0.159352 | -1.762188 | -2.122481 | 0.901474 | -1.003362 | 0.962294 | -0.260984 | -1.477606 | -1.799925 | -1.997031 |
pca_sample = PCA()
pca_sample_transformed = pca_sample.fit_transform(data_sample)
loadings = pca_sample.components_.T
explained_variance_ratio_sample = pca_sample.explained_variance_ratio_
cumulative_variance_sample = np.cumsum(explained_variance_ratio_sample)
explained_variance_table_sample = pd.DataFrame({
"Component": [f"PC{i+1}" for i in range(len(explained_variance_ratio_sample))],
"Explained Variance": explained_variance_ratio_sample,
"Cumulative Variance": cumulative_variance_sample
})
print(explained_variance_table_sample)
plt.figure(figsize=(8, 8))
plt.axhline(0, color='gray', linestyle='--', linewidth=0.5)
plt.axvline(0, color='gray', linestyle='--', linewidth=0.5)
for i, var in enumerate(data.columns):
plt.arrow(0, 0, loadings[i, 0], loadings[i, 1], color='b', alpha=0.7)
plt.text(loadings[i, 0] * 1.1, loadings[i, 1] * 1.1, var, color='r', ha='center', va='center')
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.2%})")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.2%})")
plt.title("Plot of factor loadings")
plt.grid()
plt.show()
position_colors = {
'Goalkeeper': 'green',
'Center Back': 'blue', # Defensive group
'Right Back': 'blue', # Defensive group
'Left Back': 'blue', # Defensive group
'Defensive Midfielder': 'purple', # Defensive midfielder group
'Central Midfielder': 'red', # Central midfielder group
'Attacking Midfielder': 'orange', # Attacking midfielder group
'Right Midfielder': 'orange', # Attacking Midfielder group
'Left Midfielder': 'orange', # Attacking Midfielder group
'Right Winger': 'yellow', # Winger group
'Left Winger': 'yellow', # Winger group
'Center Forward': 'brown', # Striker group
'Striker': 'brown' # Striker group
}
def get_color(position_str):
first_position = position_str.split(',')[0].strip()
return position_colors.get(first_position, 'grey')
fig_pca = go.Figure()
data_sample_positions = data_sample_with_name['positions'][:len(pca_sample_transformed)]
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_positions]
if any(position_mask):
fig_pca.add_trace(
go.Scatter(
x=pca_sample_transformed[position_mask, 0],
y=pca_sample_transformed[position_mask, 1],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>PC1: {pc1:.2f}<br>PC2: {pc2:.2f}"
for name, pos, pc1, pc2 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
pca_sample_transformed[position_mask, 0],
pca_sample_transformed[position_mask, 1])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=True
)
)
fig_pca.update_layout(
title="Interactive PCA plot with position filtering",
xaxis_title="PC1",
yaxis_title="PC2",
showlegend=True,
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
xaxis=dict(
range=[pca_sample_transformed[:, 0].min(), pca_sample_transformed[:, 0].max()],
scaleanchor="y"
),
yaxis=dict(
range=[pca_sample_transformed[:, 1].min(), pca_sample_transformed[:, 1].max()]
)
)
fig_pca.show()
Component Explained Variance Cumulative Variance 0 PC1 0.373898 0.373898 1 PC2 0.201186 0.575084 2 PC3 0.085924 0.661008 3 PC4 0.056892 0.717900 4 PC5 0.034665 0.752566 5 PC6 0.029175 0.781740 6 PC7 0.024213 0.805953 7 PC8 0.017875 0.823828 8 PC9 0.015256 0.839084 9 PC10 0.014772 0.853855 10 PC11 0.013173 0.867029 11 PC12 0.012686 0.879714 12 PC13 0.011677 0.891391 13 PC14 0.010659 0.902050 14 PC15 0.010104 0.912154 15 PC16 0.008776 0.920930 16 PC17 0.008121 0.929052 17 PC18 0.007866 0.936917 18 PC19 0.007250 0.944167 19 PC20 0.006617 0.950784 20 PC21 0.006496 0.957280 21 PC22 0.005908 0.963188 22 PC23 0.005046 0.968233 23 PC24 0.004603 0.972836 24 PC25 0.004416 0.977252 25 PC26 0.004163 0.981414 26 PC27 0.003583 0.984997 27 PC28 0.003389 0.988385 28 PC29 0.003259 0.991644 29 PC30 0.002738 0.994383 30 PC31 0.002442 0.996825 31 PC32 0.002321 0.999146 32 PC33 0.000854 1.000000
Interpretation of Skill Influence on Player Positions¶
The analysis of the plot indicates that player positions often align with their skill profiles, which can help coaches match player abilities to suitable positions.
Key Observations:
- Many players have skills better suited to positions other than those they currently play.
- The overall distribution of players across positions suggests that certain attributes are more significant for specific positions.
Practical Applications:
- Coaches can use this information to experiment with alternative positions for players.
- Helps identify "hidden potential" – skills that may be better suited to different roles on the field.
- Supports player development planning by focusing training on positions aligned with their skill profiles.
Added Value:
This analysis can aid in optimizing team formation and better utilizing individual player strengths, even if it requires deviating from traditional positions.
Interpretation Challenge:
The low number of dimensions causes difficulty in accurately interpreting players due to the low percentage of variance explained. However, PC1 and PC2 explain enough variance to determine that players with negative PC1 and PC2 scores are below average, representing lower-tier players.
Position Distribution by Plot Area:
- Positive PC1, Positive PC2: Defensive Midfielder
- Positive PC1, Negative PC2: Central Midfielder, Attacking Midfielder, Right Midfielder, Left Midfielder, Right Winger, Left Winger, Centre Forward, Forward
- Negative PC1, Negative PC2: Goalkeeper
- Negative PC1, Positive PC2: Central Defender
- Center of the plot: Right Back, Left Back
The plot analysis partially confirms conclusions from the previous visualization, although there was a misinterpretation regarding midfielder attributes. From this plot, it appears that only defensive midfielders exhibit both high offensive and defensive attributes.
pca = PCA(n_components=3)
pca_transformed = pca.fit_transform(data_sample)
fig = go.Figure()
for i, var in enumerate(data_sample.columns):
fig.add_trace(
go.Scatter3d(
x=[0, loadings[i, 0]],
y=[0, loadings[i, 1]],
z=[0, loadings[i, 2]],
mode='lines+text',
line=dict(color='blue', width=4),
text=[None, var],
textposition="top center",
showlegend=False
)
)
fig.update_layout(
scene=dict(
xaxis_title=f"PC1 ({pca_sample.explained_variance_ratio_[0]:.2%})",
yaxis_title=f"PC2 ({pca_sample.explained_variance_ratio_[1]:.2%})",
zaxis_title=f"PC3 ({pca_sample.explained_variance_ratio_[2]:.2%})",
),
title="Interactive Plot of Factor Loadings",
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
)
fig.show()
def get_color(position_str):
first_position = position_str.split(',')[0].strip()
return position_colors.get(first_position, 'grey')
fig = go.Figure()
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_with_name['positions']]
if any(position_mask):
fig.add_trace(
go.Scatter3d(
x=pca_transformed[position_mask, 0],
y=pca_transformed[position_mask, 1],
z=pca_transformed[position_mask, 2],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>PC1: {pc1:.2f}<br>PC2: {pc2:.2f}<br>PC3: {pc3:.2f}"
for name, pos, pc1, pc2, pc3 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
pca_transformed[position_mask, 0],
pca_transformed[position_mask, 1],
pca_transformed[position_mask, 2]
)],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=True
)
)
for i, var in enumerate(data_sample.columns):
fig.add_trace(
go.Scatter3d(
x=[0, loadings[i, 0] * 12],
y=[0, loadings[i, 1] * 12],
z=[0, loadings[i, 2] * 12],
mode='lines+text',
line=dict(color='grey', width=3),
text=[None, var],
textposition="top center",
showlegend=False
)
)
fig.update_layout(
scene=dict(
xaxis_title="PC1",
yaxis_title="PC2",
zaxis_title="PC3",
),
title="Interactive 3D PCA with position filtering",
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
)
fig.show()
From the Factor Loadings plot, the following groups of attributes can be identified:
- heading_accuracy, strength
- jumping, aggression, interceptions
- agility, acceleration, balance
- vision, skill_moves, curve
- short_passing, overall_rating, composure, potential
- finishing, penalties
Key positions for each group:
- Defender
- Defender, Forward
- Winger, Forward
- Winger, Midfielder
- Midfielder
- Forward, Winger
Analyzing players on such a plot can help coaches better match the required attributes to positions. The plot aids in positioning players effectively. Changing positions may not always be necessary; player attributes can guide tactical decisions. For example, if defenders demonstrate strong offensive skills, a coach might adopt a more aggressive strategy. Conversely, if the team’s attributes align well with their positions, the tactics can be more balanced.
Players with negative PC1, PC2, and PC3 generally exhibit lower skill levels, as these three components explain 63% of the variance.
Position distribution on the plot:
- Positive PC1, Negative PC2, Negative PC3: Attacking Midfielder, Right Winger, Left Winger, Centre Forward, Forward
- Positive PC1, Negative PC2, Neutral PC3: Central Midfielder, Right Midfielder, Left Midfielder
- Negative PC1, Positive PC2, Negative PC3: Central Defender
- Negative PC1, Negative PC2, Negative PC3: Goalkeeper
- Neutral PC1, Neutral PC2, Positive PC3: Right Back, Left Back
- Positive PC1, Positive PC2, Neutral PC3: Defensive Midfielder
Some groups were difficult to classify and were labeled as Neutral, indicating high variability. This suggests that physical and athletic traits are often not strongly correlated with position. However, there were cases where PC3 values were clearly positive or negative, particularly for offensive positions. This is an interesting observation, but it does not imply that players in these positions inherently have poor physical or athletic abilities. Most likely, low PC3 values reflect lower interceptions—a skill typical for defenders.
Classical Multidimensional Scaling (MDS)¶
Description
Classical MDS is a dimensionality reduction method based on the distance matrix between objects. It works by mapping distances in the original space to distances in a lower-dimensional space.
How it works
- Compute the distance matrix between points in the original space (e.g., Euclidean distance).
- The method seeks a configuration of coordinates in the new space where the distances between points best approximate the original distance matrix.
Advantages
- Simplicity: Easy to understand and apply for data with a small number of dimensions.
- Distance preservation: Faithfully represents data structure based on distances.
- No parameter tuning required: Automatically adapts to the data.
Disadvantages
- Linearity: Assumes that the data can be well-represented in the new space using a linear transformation.
- Sensitivity to noise: Distorted distances can significantly affect the results.
- High computational cost: For large datasets, computing the distance matrix and its decomposition is expensive.
distance_matrix_original = pairwise_distances(data_sample, metric='euclidean')
distance_matrix_ranks = np.argsort(np.argsort(distance_matrix_original, axis=1), axis=1)
mds = MDS(n_components=2, metric=True, dissimilarity='precomputed', random_state=0)
mds_transformed = mds.fit_transform(distance_matrix_original)
distance_matrix_mds = pairwise_distances(mds_transformed, metric='euclidean')
def stress(original, reduced):
return np.sqrt(np.sum((original - reduced) ** 2) / np.sum(original ** 2))
stress_mds = stress(distance_matrix_original, distance_matrix_mds)
print("STRESS = " + str(stress_mds))
fig_mds = go.Figure()
data_sample_positions = data_sample_with_name['positions'][:len(mds_transformed)]
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_positions]
if any(position_mask):
fig_mds.add_trace(
go.Scatter(
x=mds_transformed[position_mask, 0],
y=mds_transformed[position_mask, 1],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>Dim1: {dim1:.2f}<br>Dim2: {dim2:.2f}"
for name, pos, dim1, dim2 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
mds_transformed[position_mask, 0],
mds_transformed[position_mask, 1])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=True
)
)
fig_mds.update_layout(
title="Interactive Classical MDS plot with position filtering",
xaxis_title="Dim1",
yaxis_title="Dim2",
showlegend=True,
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
xaxis=dict(
range=[mds_transformed[:, 0].min(), mds_transformed[:, 0].max()],
scaleanchor="y"
),
yaxis=dict(
range=[mds_transformed[:, 1].min(), mds_transformed[:, 1].max()]
)
)
fig_mds.show()
STRESS = 0.21809578912205843
The STRESS coefficient is more than 20%, so the fit in two dimensions is very poor.
The groups look similar compared to two-dimensional PCA.
Unfortunately, MDS does not allow for a detailed analysis of the influence of variables on specific groups.
mds_3d = MDS(n_components=3, metric=True, dissimilarity='precomputed', random_state=0)
mds_transformed_3d = mds_3d.fit_transform(distance_matrix_original)
distance_matrix_mds_3d = euclidean_distances(mds_transformed_3d)
stress_mds_3d = stress(distance_matrix_original, distance_matrix_mds_3d)
print("STRESS = " + str(stress_mds_3d))
fig = go.Figure()
data_sample_positions = data_sample_with_name['positions'][:len(mds_transformed_3d)]
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_positions]
if any(position_mask):
fig.add_trace(
go.Scatter3d(
x=mds_transformed_3d[position_mask, 0],
y=mds_transformed_3d[position_mask, 1],
z=mds_transformed_3d[position_mask, 2],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>Dim1: {dim1:.2f}<br>Dim2: {dim2:.2f}<br>Dim3: {dim3:.2f}"
for name, pos, dim1, dim2, dim3 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
mds_transformed_3d[position_mask, 0],
mds_transformed_3d[position_mask, 1],
mds_transformed_3d[position_mask, 2])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=True
)
)
fig.update_layout(
scene=dict(
xaxis_title="Dim1",
yaxis_title="Dim2",
zaxis_title="Dim3",
),
title="Interactive 3D Classical MDS plot with position filtering",
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
)
fig.show()
STRESS = 0.14896369375640137
The STRESS coefficient is 15%, so the fit in three dimensions is poor.
The groups appear to be more spread out compared to PCA. Players in specific positions seem less related to each other.
distance_matrix_original = pairwise_distances(data_sample, metric='euclidean')
def stress(original, reduced):
return np.sqrt(np.sum((original - reduced) ** 2) / np.sum(original ** 2))
stress_values = {}
for dims in range(1, 11):
mds = MDS(n_components=dims, metric=True, dissimilarity='precomputed', random_state=0)
mds_transformed = mds.fit_transform(distance_matrix_original)
distance_matrix_mds = euclidean_distances(mds_transformed)
stress_mds = stress(distance_matrix_original, distance_matrix_mds)
stress_values[dims] = stress_mds
print(f"STRESS: {dims} Dimensions: {stress_mds:.4f}")
STRESS: 1 Dimensions: 0.4982 STRESS: 2 Dimensions: 0.2181 STRESS: 3 Dimensions: 0.1490 STRESS: 4 Dimensions: 0.1031 STRESS: 5 Dimensions: 0.0815 STRESS: 6 Dimensions: 0.0690 STRESS: 7 Dimensions: 0.0589 STRESS: 8 Dimensions: 0.0506 STRESS: 9 Dimensions: 0.0461 STRESS: 10 Dimensions: 0.0426
With three dimensions, the STRESS coefficient is 0.15, which means the fit in three dimensions is poor but acceptable. A good fit starts at eight dimensions.
Non-metric Multidimensional Scaling (NMDS)¶
Description NMDS is a dimensionality reduction method that maps not only actual distances but also their order (ranks) into a new space. It is used when the actual distances between points are unknown or unreliable.
Principle of Operation
- The ranks of distances between points in the original space are calculated.
- NMDS attempts to map these ranks into a lower-dimensional space by minimizing the STRESS function.
Advantages
- Flexibility: Can work with any similarity or distance measures (even non-standard ones).
- No linearity assumptions: Handles non-linear data structures better.
- Usefulness with non-metric data: Ideal for analyses with ordinal or subjective data.
Disadvantages
- Lack of unique solution: The result depends on the starting point and optimization.
- More difficult interpretation: Distances in the reduced space are not always intuitive to understand.
distance_matrix_original = pairwise_distances(data_sample, metric='euclidean')
mds_nonmetric = MDS(
n_components=2,
max_iter=1000,
n_init=10,
metric=False,
dissimilarity='precomputed',
random_state=0
)
mds_transformed_nonmetric = mds_nonmetric.fit_transform(distance_matrix_original)
distance_matrix_nonmetric = pairwise_distances(mds_transformed_nonmetric, metric='canberra')
def stress(original, reduced):
return np.sqrt(np.sum((original - reduced) ** 2) / np.sum(original ** 2))
stress_nonmetric = stress(distance_matrix_original, distance_matrix_nonmetric)
print("STRESS = " + str(stress_nonmetric))
fig_nonmetric = go.Figure()
data_sample_positions = data_sample_with_name['positions'][:len(mds_transformed_nonmetric)]
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_positions]
if any(position_mask):
fig_nonmetric.add_trace(
go.Scatter(
x=mds_transformed_nonmetric[position_mask, 0],
y=mds_transformed_nonmetric[position_mask, 1],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>Dim1: {dim1:.2f}<br>Dim2: {dim2:.2f}"
for name, pos, dim1, dim2 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
mds_transformed_nonmetric[position_mask, 0],
mds_transformed_nonmetric[position_mask, 1])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=True
)
)
fig_nonmetric.update_layout(
title="Interactive Non-metric MDS plot with position filtering",
xaxis_title="Dim1",
yaxis_title="Dim2",
showlegend=True,
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
xaxis=dict(
range=[mds_transformed_nonmetric[:, 0].min(), mds_transformed_nonmetric[:, 0].max()],
scaleanchor="y"
),
yaxis=dict(
range=[mds_transformed_nonmetric[:, 1].min(), mds_transformed_nonmetric[:, 1].max()]
)
)
fig_nonmetric.show()
STRESS = 0.8388173683085303
The STRESS coefficient is very high (0.8), indicating that Non-metric Multidimensional Scaling is not suitable for analyzing this dataset. The plot also shows that the distribution of positions looks random.
distance_matrix_original = pairwise_distances(data_sample, metric='euclidean')
mds_nonmetric_3d = MDS(
n_components=3,
max_iter=1000,
n_init=10,
metric=False,
dissimilarity='precomputed',
random_state=0
)
mds_transformed_nonmetric_3d = mds_nonmetric_3d.fit_transform(distance_matrix_original)
distance_matrix_nonmetric = pairwise_distances(mds_transformed_nonmetric_3d, metric='canberra')
def stress(original, reduced):
return np.sqrt(np.sum((original - reduced) ** 2) / np.sum(original ** 2))
stress_nonmetric = stress(distance_matrix_original, distance_matrix_nonmetric)
print("STRESS = " + str(stress_nonmetric))
fig_nonmetric_3d = go.Figure()
data_sample_positions = data_sample_with_name['positions'][:len(mds_transformed_nonmetric_3d)]
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_positions]
if any(position_mask):
fig_nonmetric_3d.add_trace(
go.Scatter3d(
x=mds_transformed_nonmetric_3d[position_mask, 0],
y=mds_transformed_nonmetric_3d[position_mask, 1],
z=mds_transformed_nonmetric_3d[position_mask, 2],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>Dim1: {dim1:.2f}<br>Dim2: {dim2:.2f}<br>Dim3: {dim3:.2f}"
for name, pos, dim1, dim2, dim3 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
mds_transformed_nonmetric_3d[position_mask, 0],
mds_transformed_nonmetric_3d[position_mask, 1],
mds_transformed_nonmetric_3d[position_mask, 2])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=True
)
)
fig_nonmetric_3d.update_layout(
title="Interactive 3D Non-metric MDS plot with position filtering",
scene=dict(
xaxis_title="Dim1",
yaxis_title="Dim2",
zaxis_title="Dim3"
),
showlegend=True,
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
)
fig_nonmetric_3d.show()
STRESS = 0.7604378737663599
The STRESS coefficient decreased by 0.1 compared to two dimensions, but it is still very high, leading to the same conclusions as in the two-dimensional case. The plot still looks very chaotic and not useful for analysis.
distance_matrix_original = pairwise_distances(data_sample, metric='euclidean')
def stress(original, reduced):
return np.sqrt(np.sum((original - reduced) ** 2) / np.sum(original ** 2))
stress_values = {}
for dims in range(1, 11):
mds_nonmetric = MDS(
n_components=dims,
max_iter=1000,
n_init=10,
metric=False,
dissimilarity='precomputed',
random_state=0
)
mds_transformed_nonmetric = mds_nonmetric.fit_transform(distance_matrix_original)
distance_matrix_nonmetric = pairwise_distances(mds_transformed_nonmetric, metric='canberra')
stress_nonmetric = stress(distance_matrix_original, distance_matrix_nonmetric)
stress_values[dims] = stress_nonmetric
print(f"STRESS: {dims} Dimensions: {stress_nonmetric:.4f}")
STRESS: 1 Dimensions: 0.9179 STRESS: 2 Dimensions: 0.8388 STRESS: 3 Dimensions: 0.7604 STRESS: 4 Dimensions: 0.6823 STRESS: 5 Dimensions: 0.6092 STRESS: 6 Dimensions: 0.5380 STRESS: 7 Dimensions: 0.4714 STRESS: 8 Dimensions: 0.4080 STRESS: 9 Dimensions: 0.3582 STRESS: 10 Dimensions: 0.3257
The analysis of the STRESS coefficient for successive dimensions still confirms that Non-metric Multidimensional Scaling is not suitable for dimensionality reduction for this data.
fig_combined = make_subplots(
rows=1, cols=3,
subplot_titles=["PCA", "Classical MDS", "Non-metric MDS"],
shared_yaxes=False,
shared_xaxes=False
)
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_with_name['positions']]
if any(position_mask):
fig_combined.add_trace(
go.Scatter(
x=pca_sample_transformed[position_mask, 0],
y=pca_sample_transformed[position_mask, 1],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>PC1: {pc1:.2f}<br>PC2: {pc2:.2f}"
for name, pos, pc1, pc2 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
pca_sample_transformed[position_mask, 0],
pca_sample_transformed[position_mask, 1])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=True
),
row=1, col=1
)
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_with_name['positions']]
if any(position_mask):
fig_combined.add_trace(
go.Scatter(
x=mds_transformed[position_mask, 0],
y=mds_transformed[position_mask, 1],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>Dim1: {dim1:.2f}<br>Dim2: {dim2:.2f}"
for name, pos, dim1, dim2 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
mds_transformed[position_mask, 0],
mds_transformed[position_mask, 1])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=False
),
row=1, col=2
)
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_with_name['positions']]
if any(position_mask):
fig_combined.add_trace(
go.Scatter(
x=mds_transformed_nonmetric[position_mask, 0],
y=mds_transformed_nonmetric[position_mask, 1],
mode='markers',
marker=dict(
size=5,
color=color,
opacity=0.7
),
text=[f"{name}<br>Positions: {pos}<br>Dim1: {dim1:.2f}<br>Dim2: {dim2:.2f}"
for name, pos, dim1, dim2 in zip(
data_sample_with_name['name'][position_mask],
data_sample_with_name['positions'][position_mask],
mds_transformed_nonmetric[position_mask, 0],
mds_transformed_nonmetric[position_mask, 1])],
hoverinfo='text',
name=position,
legendgroup=position,
showlegend=False
),
row=1, col=3
)
fig_combined.update_layout(
legend_title="Positions",
showlegend=True,
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
)
fig_combined.show()
fig = make_subplots(
rows=1, cols=3,
specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}, {'type': 'scatter3d'}]],
subplot_titles=["PCA", "MDS", "Non-metric MDS"]
)
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_with_name['positions']]
if any(position_mask):
fig.add_trace(
go.Scatter3d(
x=pca_transformed[position_mask, 0],
y=pca_transformed[position_mask, 1],
z=pca_transformed[position_mask, 2],
mode='markers',
marker=dict(size=5, color=color, opacity=0.7),
name=position,
legendgroup=position,
showlegend=True
),
row=1, col=1
)
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_with_name['positions']]
if any(position_mask):
fig.add_trace(
go.Scatter3d(
x=mds_transformed_3d[position_mask, 0],
y=mds_transformed_3d[position_mask, 1],
z=mds_transformed_3d[position_mask, 2],
mode='markers',
marker=dict(size=5, color=color, opacity=0.7),
name=position,
legendgroup=position,
showlegend=False
),
row=1, col=2
)
for position, color in position_colors.items():
position_mask = [position in pos for pos in data_sample_with_name['positions']]
if any(position_mask):
fig.add_trace(
go.Scatter3d(
x=mds_transformed_nonmetric_3d[position_mask, 0],
y=mds_transformed_nonmetric_3d[position_mask, 1],
z=mds_transformed_nonmetric_3d[position_mask, 2],
mode='markers',
marker=dict(size=5, color=color, opacity=0.7),
name=position,
legendgroup=position,
showlegend=False
),
row=1, col=3
)
fig.update_layout(
legend=dict(x=1.05, y=1, title="Positions"),
autosize=True,
margin=dict(l=10, r=10, t=40, b=10),
)
fig.show()
Comparison of Dimensionality Reduction Methods based on Plots¶
Based on the visualizations of the results obtained from PCA, MDS, and NMDS, differences in the ability of each method to represent data structure and grouping can be observed.
PCA (Principal Component Analysis)¶
Grouping: PCA effectively represents the data structure, grouping objects consistently with their positions.
Clarity: Visualizations for the first two or three principal components allow for clear distinction of groups. This confirms that PCA effectively utilizes information about data variance.
Data Utilization: PCA allows for dimensionality reduction with minimal information loss, making the groups clear and intuitive.
MDS (Multidimensional Scaling)¶
Grouping: MDS also shows good grouping abilities, and in some cases surpasses PCA, especially concerning objects with specific characteristics.
Dimension Adjustment: Visualizations indicate that MDS performs well even with a low number of dimensions. Decreasing STRESS values confirm that the method effectively represents the data structure.
Precision: Compared to PCA, MDS sometimes better differentiates objects with subtle differences, which can be beneficial in more complex analyses.
NMDS (Non-Metric Multidimensional Scaling)¶
Grouping: NMDS was not effective in analyzing this data. On the plots, points are scattered chaotically, with no clear grouping.
Fit Issues: High STRESS values (even with a higher number of dimensions) indicate that NMDS struggles to represent the data structure non-linearly.
Practicality: The NMDS method may be more suitable for data with unclear structures or when metric methods cannot be used, but in this case, it was the least effective.
Conclusions and Recommendations¶
PCA and MDS show similar grouping properties and represent the data well. PCA is more effective when explaining variance is key, while MDS can be more precise in analyzing subtle differences between objects.
NMDS was not effective in this data analysis – the chaos in the plots and lack of clear groups suggest that the method is not suitable for this specific analysis.
The choice between PCA and MDS should depend on the specific problem:
PCA works well for general analyses where understanding the variance structure is crucial.
MDS may be better for more complex data dependencies.
Practical Recommendations¶
For visualizing and grouping players based on their positions, PCA seems to be the best choice as it provides clarity and minimal information loss.
MDS can be used complementarily to PCA if additional details regarding differences between objects are desired.
NMDS is worth considering in other types of analyses, but for this dataset, its application was not effective.
pca_data = {
"Dimensions": range(1, len(explained_variance_ratio_sample) + 1),
"PCA Component": [f"PC{i+1}" for i in range(len(explained_variance_ratio_sample))],
"Explained Variance PCA": explained_variance_ratio_sample,
"Cumulative Variance PCA": cumulative_variance_sample,
}
stress_mds = [
0.4982, 0.2181, 0.1490, 0.1031, 0.0815, 0.0690, 0.0589, 0.0506, 0.0461, 0.0426
]
stress_nmds = [stress_values[dims] for dims in range(1, 11)]
combined_data = {
"Dimensions": pca_data["Dimensions"][:10],
"PCA Component": pca_data["PCA Component"][:10],
"Explained Variance PCA": pca_data["Explained Variance PCA"][:10],
"Cumulative Variance PCA": pca_data["Cumulative Variance PCA"][:10],
"STRESS MDS": stress_mds[:10],
"STRESS NMDS": stress_nmds[:10],
}
df = pd.DataFrame(combined_data)
df['Explained Variance PCA'] = df['Explained Variance PCA'].round(4)
df['Cumulative Variance PCA'] = df['Cumulative Variance PCA'].round(4)
df['STRESS NMDS'] = df['STRESS NMDS'].round(4)
print(df)
Dimensions PCA Component Explained Variance PCA Cumulative Variance PCA STRESS MDS STRESS NMDS 0 1 PC1 0.3739 0.3739 0.4982 0.9179 1 2 PC2 0.2012 0.5751 0.2181 0.8388 2 3 PC3 0.0859 0.6610 0.1490 0.7604 3 4 PC4 0.0569 0.7179 0.1031 0.6823 4 5 PC5 0.0347 0.7526 0.0815 0.6092 5 6 PC6 0.0292 0.7817 0.0690 0.5380 6 7 PC7 0.0242 0.8060 0.0589 0.4714 7 8 PC8 0.0179 0.8238 0.0506 0.4080 8 9 PC9 0.0153 0.8391 0.0461 0.3582 9 10 PC10 0.0148 0.8539 0.0426 0.3257
PCA¶
PCA identifies 10 principal components, where each successive component explains a decreasing portion of data variance.
The first component (PC1) explains the largest portion of variance – 36.7%.
Cumulative variance reaches 87.65% with 10 components, indicating that the data is well explained by these components.
For 4 components (PC1–PC4), cumulative variance is 69.25%, suggesting that the first 4 components may be sufficient to describe most information.
STRESS for MDS¶
The STRESS value decreases with an increasing number of dimensions, indicating a better model fit.
For 1 dimension, STRESS is 0.5191, indicating a poor data representation.
For 4 dimensions, STRESS drops to 0.1108, indicating a moderate model fit.
For 10 dimensions, STRESS reaches 0.0386, suggesting that further increasing dimensionality may not significantly improve the fit.
STRESS for NMDS¶
STRESS values for NMDS are significantly higher compared to MDS, indicating a more difficult model fit with this approach.
For 1 dimension, STRESS is as high as 0.8985, indicating a very poor data representation.
For 4 dimensions, STRESS decreases to 0.6138, which is still a very poor result.
Even for 10 dimensions, STRESS is 0.3452, meaning that NMDS in this case has problems accurately representing the data.
Overall Comparison¶
PCA achieves very good results – even the first few components explain a significant portion of variance, making this method effective for data dimensionality reduction.
MDS represents data better than NMDS, especially with fewer dimensions.
NMDS can be useful in specific cases but requires a larger number of dimensions to achieve decent results with this dataset.
Conclusions¶
If the goal is to explain data variance, PCA is the most effective tool.
MDS represents data better than NMDS, especially with a lower number of dimensions.
NMDS may be useful in specific cases, but in this dataset, its application was not effective.
Project Summary¶
The project aimed to analyze and visualize FIFA player skills, focusing exclusively on their technical, on-field, and mental attributes. Personal, physical, and economic data were deliberately omitted to obtain an objective picture of football skills. The main dimensionality reduction methods applied were PCA, classical MDS, and non-metric MDS.
Analysis Results¶
PCA¶
Reducing dimensions to 7 allowed retaining over 80% of variance.
PC1 and PC2 best represent players' offensive and defensive skills:
PC1 (35.54% variance): Focuses on offensive skills like dribbling, vision, and shooting.
PC2 (18.63% variance): Relates to defensive skills such as strength and ball interception.
The analysis showed consistency between skills and player positions:
Offensive positions (e.g., strikers) are characterized by high PC1 values.
Defensive positions (e.g., defenders) have high PC2 values.
MDS¶
Classical MDS reflected data structures, although the STRESS coefficient in two and three dimensions indicates a moderate fit.
Player groups are less distinct than in PCA, and the structure requires a higher number of dimensions for more accurate analysis.
NMDS¶
NMDS is not suitable for analyzing this dataset due to a high STRESS coefficient (>0.8).
Plots indicate a chaotic distribution, not useful for drawing conclusions.
Method Comparison¶
PCA: Best represents data structure with minimal information loss. Crucial for visualizing and identifying skill groups.
MDS: Can better differentiate subtle differences in data but requires a higher number of dimensions for a better fit.
NMDS: Inappropriate for this dataset due to issues with result interpretation.
Practical Applications¶
Position Matching: The analysis helps identify players whose skills may better suit alternative positions.
Development Planning: Results can support coaches in guiding training, e.g., developing traits crucial for a desired role on the field.
Tactical Optimization: Understanding players' skill profiles allows for adjusting game strategy, e.g., aggressive offense or balanced defense.
Conclusions¶
The project confirmed that PCA is the best dimensionality reduction method for FIFA data, enabling the analysis of player skills. MDS and NMDS showed limited usefulness compared to PCA. The analysis highlights the potential value of data in optimizing football team setups and strategies.